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Abstract 

Discovering frequent episodes over event sequences is an important data mining task. In many appli- 
cations, events constituting the data sequence arrive as a stream, at furious rates, and recent trends (or 
frequent episodes) can change and drift due to the dynamical nature of the underlying event generation 
process. The ability to detect and track such the changing sets of frequent episodes can be valuable 
in many application scenarios. Current methods for frequent episode discovery are typically multipass 
algorithms, making them unsuitable in the streaming context. In this paper, we propose a new streaming 
algorithm for discovering frequent episodes over a window of recent events in the stream. Our algorithm 
processes events as they arrive, one batch at a time, while discovering the top frequent episodes over a 
window consisting of several batches in the immediate past. We derive approximation guarantees for our 
algorithm under the condition that frequent episodes are approximately well-separated from infrequent 
ones in every batch of the window. We present extensive experimental evaluations of our algorithm on 
both real and synthetic data. We also present empirical comparisons with baselines and adaptations of 
, streaming algorithms from itemset mining literature. 

> 

1^ 1 Introduction 

The problem of discovering interesting patterns from large datasets has been well studied in the form of 
pattern classes such as itemsets, sequential patterns, and episodes with temporal constraints. However, most 
^ of these techniques deal with static datasets, over which multiple passes are performed, 
i— ( In many domains like telecommunication and computer security, it is becoming increasingly difficult 

j> to store and process data at speeds comparable to their generation rate. A few minutes of call logs data 
in a telecommunication network can easily run into millions of records. Such data are referred to as data 
streams [T2]. A data stream is an unbounded sequence where new data points or events arrive continuously 
& and often at very high rates. Many traditional data mining algorithms are rendered useless in this context as 
one cannot hope to store the entire data and then process it. Any method for data streams must thus operate 
under the constraints of limited memory and processing time. In addition, the data must be processed faster 
than it is being generated. In this paper, we investigate the problem of mining temporal patterns called 
episodes under these constraints; while we focus on discovering frequent episodes from event streams, our 
method is general and adaptable to any class of patterns that might be of interest over the given data. 

In several applications where frequent episodes have been found to be useful, share the streaming data 
characteristics. In neuroscience, multi electrode arrays are being used as implants to control artificial 
prosthetics. These interfaces interpret commands from the brain and direct external devices. Identifying 
controlling signals from brain is much like finding a needle in the hay stack. Large volumes of data need to 
be processed in real time to be able to solve this problem. Similar situations exist in telecom and computer 
networks where the network traffic and call logs must be analyzed to detect attacks or fraudulent activity. 
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A few works exist in current literature for determining frequent itemsets from a stream of transactions 
(e.g. see [16J). However, they are either computationally impractical, due to worst-case assumptions, or 
ineffective due to strong independence assumptions. We make no statistical assumptions on the stream, 
independence or otherwise. We develop the error characterization of our algorithms by identifying two 
key properties of the data, namely, maximum rate of change and top-k separation. Our key algorithmic 
contribution is an adaptation of the border sets datastructures to reuse work done in previous batches when 
computing frequent patterns of the current batch. This reduces the candidate generation effort from F 2 to 
FF new (where F denotes the number of frequent patterns of a particular size in the previous batch, while 
F new denotes the number of newly frequent patterns of that same size in the current batch). Experimental 
work demonstrates the practicality of our algorithms, both in-terms of accuracy of the returned frequent 
pattern sets as well as in terms of computational efficiencies. 

2 Preliminaries 

In the framework of frequent episodes [Sj, an event sequence is denoted as ((ei, Ti), . . . , (e n ,r n )), where (e,, Tj) 
represents the i th event; ej is drawn from a finite alphabet £ of symbols (called event-types) and t% denotes 
the time-stamp of the i th event, with Tj+i > Tj, i = 1, . . . , (n — 1). An l-node episode a is defined by a 
triple a = (V a , < a ,ga), where V a = {v±, . . . ,vg} is a collection of t nodes, < a is a partial order over V a and 
9a ■ Va — > £ is a map that assigns an event- type g a (v) to each node v £ V a . There are two special classes of 
episodes: When < a is total a is called a serial episode and when it is empty, it is called a parallel episode. 
An occurrence of an episode a is a map h : V a — > {1, . . • , n} such that e^ v ) = g(v) for all v £ V a and for 
all pairs of nodes v,v' S V a such that v < a v' the map h ensures that T h ^ < T h i v iy Two occurrences of an 
episode are non-overlapped [7] if no event corresponding to one appears in-between the events corresponding 
to the other. The maximum number of non-overlapped occurrences of an episode is defined as its frequency 
in the event sequence. The task in frequent episode discovery is to find all patterns whose frequency exceeds 
a user-defined threshold. Given a frequency threshold, Apriori-style level-wise algorithms [9J [1] can be used 
to obtain the frequent episodes in the event sequence. An important variant of this task is top-/c episode 
mining, where, rather than issue a frequency threshold to the mining algorithm, the user supplies the number 
of top frequent episodes that need to be discovered. 

Definition 1 (Top- A: episodes of size £). The set of top-k episodes of size t is defined as the collection of 
all i-node episodes with frequency greater than or equal to the frequency f k of the k th most frequent l-node 
episode in the given event sequence. 

Note that the number of top-fc £-node episodes can exceed k, although the number of ^-node episodes 
with frequencies strictly greater than f k is at most (k — 1). 

3 Problem Statement 

The data available (referred to as an event stream) is a potentially infinite sequence of events: 



Our goal is to find all episodes that were frequent in the recent past and to this end, we consider a sliding 
window model for the window of interest of the useiQ In this model, the user wants to determine episodes 
that are frequent over a window of fixed-size and terminating at the current time-tick. As new events arrive 
in the stream, the user's window of interest shifts, and the data mining task is to next report the frequent 
episodes in the new window of interest. 

Typically, the window of interest is very large and cannot be stored and processed in-memory. This 
straightaway precludes the use of standard multi-pass algorithms for frequent episode discovery over the 
window of interest. Events in the stream can be organized into batches such that at any given time only the 

1 Streaming patterns literature has also considered other models, such as the landmark and time-fading models [4], but we 
do not consider them in this paper. 
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Figure 1: A sliding window model for episode mining over event streams: B s is the most recent batch of 
events that arrived in the stream and W s is the window of interest over which the user wants to determine 
the set of frequent episodes. 



Table 1: Window frequencies in Example\l\ 



Episode 


ABCD 


MNOP 


EFGH 


WXYZ 


IJKL 


PQRS 


Window Freq 


35 


34 


25 


24 


23 


19 



new incoming batch needs to be stored and processed in memory. This is illustrated in Fig. [T] The current 
window of interest is denoted by W s and the most recent batch, B s , consists of a sequence of events in V 
with times of occurrence, 73, such that, 

(s - l)T b <n< sT b (2) 

where T& is the time-span of each batch and s is the batch number (s = 1,2,.. .|^] The frequency of an 
episode a in a batch B s is referred to as its batch frequency f s (a). The current window of interest, W s , 
consists of m consecutive batches ending in batch B s , i.e. 

W s = (f? s _ m+ i, B s ^ m+ 2, . . . , B s ) (3) 

Definition 2 (Window Frequency). The frequency of an episode a over window W s , referred to as its 
window frequency and denoted by f w "(a), is defined as the sum of batch frequencies of a in W s . Thus, if 
f-i(a) denotes the batch frequency of a in batch Bj, then the window frequency of a is given by f Ws (a) = 

In summary, we are given an event stream (T>), a time-span for batches the number of consecutive 
batches that constitute the current window of interest (m), the desired size of frequent episodes (£) and the 
desired number of most frequent episodes (k). We are now ready to formally state the problem of discovering 
top-/c episodes in an event stream. 

Problem 1 (Streaming Top- A; Mining). For each n- ew batch, B s , of events in the stream, find all (.-node 
episodes in the corresponding window of interest, W s , whose window frequencies are greater than or equal to 
the window frequency, /*, of k th most frequent (.-node episode in W s . 
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In general, the top-fe episodes over a window may be quite different from the top-fc episodes in the individual 
batches constituting the window. This is illustrated through an example in Fig. [2] 

Example 1 (Window Top-/c v/s Batch Top-/c). Let W be a window of four batches B±, . . . , B4. The episodes 
in each batch with corresponding batch frequencies are listed in Fig. The corresponding window frequencies 
(sum of each episodes' batch frequencies) are listed in Table^ The top- 2 episodes in B\ are (PQRS) and 

2 We assume that the number of events in any batch is bounded above and that we have sufficient memory to store and 
process all events that occur in a batch. For example, if time is integer-valued and if only one event occurs at any time-tick, 
then there are at most Ti, events in any batch. 
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Figure 2: Batch frequencies in Example^ 

(WXYZ). Similarly (EFGH) and (IJKL) are the top-2 episodes in B 2 , and so on. (ABCD) and (MNOP) 
have the highest window frequencies but never appear in the top-2 of any batch - these episodes would 'fly 
below the radar' and go undetected if we considered only the top-2 episodes in every batch as candidates for 
the top-2 episodes over W . This example can be easily generalized to any number of batches and any k. 

Example [7] highlights the main challenge in the streaming top-A; mining problem: while we can only store 
and process the most recent batch of events in the window of interest, the batchwise top-A; episodes may 
not contain sufficient informative about the top-A; over the entire window. It is obviously not possible to 
count and track all episodes (both frequent and infrequent) in every batch in the window, since the pattern 
space is typically very large. This brings us to the question of which episodes to select and track in every 
batch. How deep must we search within each batch for episodes that have potential to become top-A; over 
the window? In this paper, we develop the formalism to answer this question. We identify two important 
properties of the underlying event stream which determine the design and analysis of our algorithms. These 
are stated in Definitions^ & below. 

Definition 3 (Maximum Rate of Change, A). The maximum change in batch frequency of any episode, a, 
across any pair of consecutive batches, B s and B s+ \, is bounded above by A(> 0), i.e., 

\r +1 (a)-f(a)\<A, (4) 

and A is referred to as the maximum rate of change. 

Intuitively, A controls the extent of change that we may see from one batch to the next. It is trivially 
bounded above by the maximum number of events arriving per batch, and in practice, it is in fact much 
smaller. 

Definition 4 (Top-A; Separation of (cp, e)). A batch B s of events is said to have a top-A; separation of (cp, e), 
<P > 0, e > 0, if there are no more than (1 + e)k episodes with batch frequency greater than or equal to 
(f% — cpA), where /| denotes the batch frequency of the k th most-frequent episode in B s and A denotes the 
maximum rate of change as per Definition [3[ 

This is a measure of how well-separated the frequencies of the top-A; episodes are relative to the rest of 
the episodes. We expect to see roughly k episodes with batch frequencies of at least /* and the separation 
can be considered to be high (or good) if e can remain small even for relatively large cp. We observe that 
e is a non-decreasing function of cp and that top-A; separation is measured relative to the maximum rate of 
change A. Also, top-A; separation of any given batch of events is characterized through not one but several 
pairs of (cp, e) since cp and e are essentially functionally related - e is typically close to zero for cp = and e 
is roughly the size of the entire class of episodes (minus k) for cp > 
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We now use the maximum rate of change property to design efficient streaming algorithms for top-A; 
episode mining and show that top-A; separation plays a pivotal role in determining the quality of approxi- 
mation that our algorithms can achieve. 

Lemma 1. Consider two consecutive batches, B s and B s+ \, with a maximum rate of change A. The batch 
frequencies of the k th most-frequent episodes in the corresponding batches are related as follows: 

\r k +i -m<A (5) 

Proof. There exist at least k episodes in B s with batch frequency greater than or equal to ft (by definition). 
Hence, there exist at least k episodes in B s+ i with batch frequency greater than or equal to (ft — A) 
(since frequency of any episode can decrease by at most A going from B s to B s+ i). Hence we must have 
> (/| — A). Similarly, there can be at most (k — 1) episodes in B s+ i with batch frequency strictly 
greater than (/| + A). Hence we must also have f^ +1 < (ft + A). □ 

Next we show that if the batch frequency of an episode is known relative to ft in the current batch B s , 
we can bound its frequency in a later batch. 

Lemma 2. Consider two batches, B s and B s+r , r € Z, located r batches away from each other. If A is 
the maximum rate of change ( as per Definition [3j) then the batch frequency of any episode a in B s+T must 
satisfy the following: 

1. Iff s (a) > ft, then f°+ r (a) > f k +r - 2\r\A 

2. Iff s (a) < f s k , then f°+ r (a) < f k +r + 2\r\A 

Proof. Since A is the maximum rate of change, we have f s+r {a) > (f s (a) — \r\A) and from Lemma [TJ we 
have f s k +r < (/! + |r|A). Therefore, if f s {a) > /|, then 

+ |r|A > f(a) > f s k > f s k +r - \r\A 

which implies f s+r (a) > f k +r - 2\r\A. Similarly, if f s (a) < /|, then 

/ s +» - |r|A < f(a) < f s k < f k +r + \r\A 

which implies f s+r (a) < f s k +r + 2|r|A. □ 

Lemma [|] gives us a way to track episodes that have potential to be in the top-A; of future batches. This 
is an important property which our algorithm exploits and we recorded this as a remark below. 

Remark 1. The top-k episodes of batch, B s+r , r G 7L, must have batch frequencies of at least (ft — 2\r\ A) 
in batch B s . Specifically, the top-k episodes of B s+ \ must have batch frequencies of at least (ft — 2 A) in B s . 



Based on the maximum rate of change property we can derive a necessary condition for any episode to 
be top-A; over a window. The following theorem prescribes the minimum batch frequencies that an episode 
must satisfy if it is a top-/c episode over the window W s . 

Theorem 1 (Exact Top-A; over W s ). An episode, a, can be a top-k episode over window W s only if its batch 
frequencies satisfy f s ' (a) > (f k — 2(m — 1)A) VB s i G W s . 

Proof. Consider an episode /3 for which f s (f3) < (ft — 2(m — 1)A) in batch B s / G W s . Let a be any top-A; 
episode of B s i . In any other batch B p 6 W s , we have 

F(a)>f s '(a)-\p-s'\A 

>ft'-\p-s'\A (6) 
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and 

f p ((3)<f'(l3) + \ P -s'\A 

<(f k ' -2(m-l)A) + \p-s'\A (7) 

Applying \p — s'\ < (m — 1) to the above, we get 

f p (a)>ti-(m-l)A>f(f3) (8) 

This implies f Ws (f3) < f Ws (a) for every top-A; episode a of B s >. Since there are at least k top-A; episodes in 
B s /, f3 cannot be a top-A; episode over the window W s . □ 

Based on Theorem [JJ we can have the following simple algorithm for obtaining the top-A; episodes over 
a window: Use a traditional level-wise approach to find all episodes with a batch frequency of at least 
(/* — 2(m — 1)A) in the first batch (Bi), simply accumulate their corresponding batch frequencies over all 
m batches of W s and report the episodes with the k highest window frequencies over W s . This approach is 
guaranteed to give us the exact top- A: episodes over W s . Further, in order to report the top-A; over the next 
sliding window W s +i, we need to consider all episodes with batch frequency of at least (f^ — 2(m — 1)A) in 
the second batch and track them over all batches of W s +i, and so on. Thus, an exact solution to Problem^ 
would require running a level- wise episode mining algorithm in every batch, B s , s = 1,2,..., with a frequency 
threshold of - 2(m - 1)A). 

4.1 Class of (v, &)-Persistent Episodes 

Theorem [JJ characterizes the minimum batchwise computation needed in order to obtain the exact top-k 
episodes over a sliding window. This is effective when A and m are small (compared to f$ ). However, the 
batchwise frequency thresholds can become very low in other settings, making the processing time per-batch 
as well as the number of episodes to track over the window to become impractically high. To address this 
issue, we introduce a new class of episodes called (v, k) -persistent episodes which can be computed efficiently 
by employing higher batchwise thresholds. Further, we show that these episodes can be used to approximate 
the true top-A; episodes over the window and the quality of approximation is characterized in terms of the 
top-A: separation property (cf. Definition^. 

Definition 5 ((v, fc)-Persistent Episode). A pattern is said to be (v, A;)-persistent over window W s if it is a 
top-k episode in at least v batches ofW s . 

Problem 2 (Mining (v, fc)-Persistent Episodes). For each new batch, B s , of events in the stream, find all 
(.-node (v,k) -persistent episodes in the corresponding window of interest, W s . 

Theorem 2. An episode, a, can be (v,k) -persistent over the window W s only if its batch frequencies satisfy 
f s '(ct) > (/£' - 2(m - v)A) for every batch B s > £ W s . 

Proof. Let a be (v, /c)-persistent over W s and let V a denote the set of batches in W s in which a is in the 
top-A;. For any B q ^ V a there exists B^i q \ G V a that is nearest to B q . Since \V a \ > v, we must have 
\p(q) ~ q\ — ( m — v )- Applying Lemmo||we then get f q (a) > f£ — 2{m — v)A for all B q ^ V a . □ 

Theorem gives us the necessary conditions for computing all (v, &)-persistent episodes over sliding 
windows in the stream. The batchwise threshold required for (v, A:)-persistent episodes depends on the 
parameter v. For v = 1, the threshold coincides with the threshold for exact top-A; in Theorem [7J The 
threshold increases linearly with v and is highest at v = m (when the batchwise threshold is same as the 
corresponding batchwise top-A; frequency). 

The algorithm for discovering (v, A;)-persistent episodes follows the same general lines as the one described 
earlier for exact top-A; mining, only that we now apply higher batchwise thresholds: For each new batch, 
B s , entering the stream, use a standard level- wise episode mining algorithm to find all episodes with batch 



frequency of at least (f^ — 2(m — v)A). (We provide more details of our algorithm later in Sec. 4.2). First, 
we investigate the quality of approximation of top-A; that (v, A;)-persistent episodes offer and show that the 
number of errors is closely related to the degree of top-A; separation in the data. 
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4.1.1 Top-A; Approximation 

The main idea here is that, under a maximum rate of change A and a top-k separation of e), there 
cannot be too many distinct episodes which are not (v, fc)-persistent, while having sufficiently high window 
frequencies. To this end, we first compute a lower-bound on the window frequencies of (v, fc)-persistent 
episodes and an upper-bound (fu) on the window frequencies of episodes that are not (y, fc)-persistent 
(cf. Lemmas [3| & Q . 

Lemma 3. If episode a is (v, k) -persistent over a window, W s , then its window frequency, f Ws (a), must 
satisfy the following lower-bound: 

f W ° (a) > E fk -(m-v)(m-v + 1) A d ^ f f L (9) 

B 3 > 



Proof. Consider episode a that is (v, /c)-persistent over W s and let V a denote the batches of W s in which a 
is in the top-k. The window frequency of a can be written as 

f W °{*) = £ fP(a)+ E f{a) 

B p EVa B q £W s \Vct 

> E f!+ E ft-qm-ii* 

B p £V a B q ew s \v a 

= E /*" E 2 |j^)-g|A 

B s ,ew s B q dW s \V a 



(10) 



where B^r q \ G V a denotes the batch nearest B q where a is in the top-k. Since \W S \ V a \ < (m — v), we must 
have 

E Ip0?)-9| < (l + 2 + --- + (m-u)) 



-(m — v)(m — v + 1) 



Putting together (10) and (11) gives us the lemma. 



□ 



Lemma 4. If episode (3 is not (v , k) -persistent over a window, W s , then its window frequency, f Ws (f3), must 
satisfy the following upper-bound: 

dcf 



/^(/3)<E/fc'+^ + 1 ) A = fu 



(12) 



Proof. Consider episode /3 that is not (v, fe)-persistent over W s and let Vp denote the batches of W s in which 
f3 is in the top-k. The window frequency of f3 can be written as: 

fw.w = j2 / p (/3)+ E 

Bp&Vp B q <=W s \Vf) 

< E 4 P + 2|%)-?|A+ e ft 

Bp&Vp B q eW 8 \V fi 

= E ti + E 2|g(p)-p|A (13) 

B s ,&W a Bp&Vp 

where B^r p \ G W s \ Vp denotes the batch nearest B p where j3 is not in the top-k. Since \Vp\ < v, we must 
have 

E W)~P\ ^ (l + 2 + --. + (v-l)) 

BpdVp 

= \v{v + l) (14) 
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Putting together (13) and (14) gives us the lemma. 



□ 



It turns out that fjj > Jl Vu, 1 < v < m, and hence there is always a possibility for some episodes which 
are not (v, A;)-persistent to end up with higher window frequencies than one or more {v , Ai)-persistent episodes. 
We observed a specific instance of this kind of 'mixing' in our motivating example as well (cf. Example^. 
This brings us to the top-A; separation property that we introduced in Definition^ Intuitively, if there is 
sufficient separation of the top-A; episodes from the rest of the episodes in every batch, then we would expect 
to see very little mixing. As we shall see, this separation need not occur exactly at A; th most-frequent episode 
in every batch, somewhere close to it is sufficient to achieve a good top-A; approximation. 

Definition 6 (Band Gap Episodes, Q v ). In any batch B s i £ W s , the half-open frequency interval [/5 — 
ipA, /5) is called the band gap of B s >. The corresponding set, Qm, of hand gap episodes over the window 
W s , is defined as the collection of all episodes with batch frequencies in the band gap of at least one B s i £ W s . 



The main feature of G v is that, if ip is large-enough, then the only episodes which are not (v, fc)-persistent 
but that can still mix with (v, A;)-persistent episodes are those belonging to Q^. This is stated formally in 
the next lemma. 

Lemma 5. If % > max{l, (1 — —)(m — v + 1)}, then any episode (3 that is not (v , k) -persistent over W s , 



can have f Ws {P) > f L only if (3 £ Q, 



f 

Proof. If an episode (5 is not (v, A;)-persistent over W s then there exists a batch B s > £ W s where f3 is not in 
the top-A;. Further, if j3 ^ then we must have f s '(/3) < — ipA. Since (p > 2, p cannot be in the top-A; 
of any neighboring batch of B s /, and hence, it will stay below /*, — 93 A for all B s i £ W s , i.e., 

B 8 ,&W S 

The Lemma follows from the given condition % > (1 — ^)(rn — v + 1). □ 

The number of episodes in is controlled by the top-A; separation property, and since many of the 
non-persistent episodes which can mix with persistent ones must spend not one, but several batches in the 
band gap, the number of unique episodes that can cause such errors is bounded. Theorem [3] is our main 
result about quality of top-/c approximation that (v , A;)-persistence can achieve. 

Theorem 3 (Quality of Top-A; Approximation). Let every batch B s > G W s have a top-k separation of (<p, e) 
with ^ > max{l, (1 — ^)(m — v + 1)}. Let V denote the set of all (v, k) -persistent episodes over W s . If 

\P\ > k, then the top-k episodes over W s can be determined from V with an error of no more than ' 

episodes, where fi = min{m — v + 1, % , i(yT + 2mip — 1)}. 



Proof. By top-A; separation, we have a maximum of (1 + e)k episodes in any batch B s i £ W s , with batch 
frequencies greater than or equal to ft — if A. Since at least k of these must belong to the top-A; of the B s i, 
there are no more than ek episodes that can belong to the band gap of B s i. Thus, there can be no more 
than a total of ekm episodes over all m batches of W s that can belong to Q v . 

Consider any /3 ^ V with / "(/?) > /l ~ these are the only episodes whose window frequencies can 
exceed that of any a £ V (since fi is the minimum window frequency of any a). If fi denotes the minimum 
number of batches in which (3 belongs to the band gap, then there can be at most (^p^j such distinct 

p. Thus, if \V\ > k, we can determine the set of top-k episodes over W s with error no more than {^jp^j 
episodes. 

There are now two cases to consider to determine \x: (i) (3 is in the top-A; of some batch, and (ii) j3 is not 
in the top-A; of any batch. 

Case (i): Let f3 be in the top- Ac of B s i £ W s . Let B s n £ W s be t batches away from B s /. Using Lemma^ 
we get f s "([3) > f k s „ - 2tA. The minimum t for which (/ s fe „ - 2tA < f* - <pA) is (f). Since /3 V, /3 is 



S 



below the top-k in at least (m — v + batches. Hence /3 stays in the band gap of at least min{m — v + 1, ^} 
batches of W s . 

Case (ii): Let Vg denote the set of batches in W s where j3 lies in the band gap and let \Vg\ = g. Since 
j3 does not belong to top-fc of any batch, it must stay below the band gap in all the (m — g) batches of 
(Wj \ Vq)- Since A is the maximum rate of change, the window frequency of f3 can be written as follows: 

f w -(fi) = E w)+ E w) 

B P ev G B q dW a \V G 

< E E if}-**) ( 15 ) 

B p evfc B q ew a \v G 
Let Bfflp) denote the batch in W s \ Vq that is nearest to B p eVg- Then we have: 

f p (P) < f^(P) + \p-q(p)\A 

< /^,)-yA+|p-g(p)|A 

< /*- ¥ ,A + 2|p-$(p)|A (16) 

where the second inequality holds because [3 is below the band gap in B^ p ^ and (16) follows from Lemma^ 
Using (16) in (15) we get 



f w 'iP) < e fs'- m ^+ E 2 i^-^)i A 

B s ,ew a B p eV G 
< E /*-mpA + 2(l + 2 + ... + s)A 

B a ,£W a 

= E f$-™pA + g(g + l)A = \JB (17) 

B a ,ew a 

The smallest g for which (/ s (/3) > /l) is feasible can be obtained by setting UB > /x,. Since ^ > 
(l-£)(m-v + l),UB>/ x implies 

e ^-^a + , (5+ i)a > ^ 

B s ,&W a B S ,£W S 

Solving for g, we get g > ^(\/l + 2my? — 1). Combining cases (i) and (ii), we get \i = min{m — v + 
l,f,|(Vl + 2m^-l)}. " □ 

Theorem [3] shows the relationship between the extent of top-k separation required and quality of top- 
k approximation that can be obtained through (v, /c)-persistent episodes. In general, [i increases with ^ 
until the latter starts to dominate the other two factors, namely, (to. — v + 1) and |(\/1 + 2mip — 1). The 
theorem also brings out the tension between the persistence parameter v and the quality of approximation. 
At smaller values of v, the algorithm mines 'deeper' within each batch and so we expect fewer errors with 
respect to the true top-fe epispodes. On the other hand, deeper mining within batches is computationally 
more intensive, with the required effort approaching that of exact top- A: mining as v approaches 1. Finally, 
we use Theorem^to derive error-bounds for three special cases; first for v = 1, when the batchwise threshold 
is same as that for exact top-k mining as per Theorem^ second for v = to, when the batchwise threshold is 
simply the batch frequency of the A; th most-frequent episode in the batch; and third, for v = [^y^I , when 
the batchwise threshold lies midway between the thresholds of the first two cases. 

Corollary 1. Let every batch B s i G W s have a top-k separation of (<p, e) and let W s contain at least to > 2 
batches. LetV denote the set of all (v,k) -persistent episodes overW s . If we have \V\ > k, then the maximum 
number of errors in the top-k episodes derived from V , for three different choices of v, is given by: 



1. 



(0),forv = l > if%>(m-l) 



9 



2. (ekm), for v = m, if % > 1 



3.(0,), for v=\*p\,if$>$ i \*?\ m 



Proof. We show the proof only for v = L^ 2 -^] ■ The cases of v = 1 and v = m are obtained immediately 
upon application of Theorem^ 

Fixing u = implies (m - u) = [2^1] . For m > 2, f > i [^1] [roil] implies f > max{l, (1 - 

^)(m — v + 1)}. Let i m in = min{m — t> + 1, £}. The minimum value of t m i n is governed by 



> min 
1 

m 



m + 1 



m — 1 



> 



7?) 



2 

2 -l 



1 

m 

777- + 1 



777,-1 



777 + 1 



4?77 



(18) 



Let 5 m i n = |(yT + 2777V? ~ !)■ V > ^ l" 1 ^ 1 ] l" 11 ^] implies sr min > From Theorem [s| we have 

' 7T7 2 — 1 ' 



/j = min{t min ,5r min } > 



and hence the number of errors is no more than [ 4e £ m ) . 



4?77, 



□ 



4.2 Incremental Algorithm 

In this section we present an efficient algorithm for incrementally mining patterns with frequency > (/^ — &). 
From our formalism, the value of 6 is specified by the type of patterns we want to mine. For (y, k) persistence, 
6 = 2(777 — v)A whereas for mining the exact top-A; the threshold is 2(m — 1) A. 



J~ S-l 




After a new 
batch arrives 




Old episodes no 
longer frequent 



New 
frequent 
episodes 



s-l 



Figure 3: The set of frequent patterns can be incrementally updated as new batches arrive. 

Recall that the goal of our mining task is to report frequent patterns of size t. After processing the data 
in the batch B s _i, we desire all patterns with frequency greater than {f^~ l — 0). Algorithmically this is 
achieved by first setting a high frequency threshold and mining for patterns using the classical level wise 
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Apriori method [2]. If the number of patterns of size-£ is less than k, the support threshold is decreased and 
the mining repeated until atleast k i-size patterns are found. At this point ff, is known. The mining process 
is repeated once more with the frequency threshold {ff, — 9). Doing this entire procedure for every new 
batch can be expensive and wasteful. After seeing the first batch of the data, whenever a new batch arrives 
we have information about the patterns that were frequent in the previous batch. This can be exploited 
to incrementally and efficiently update the set of frequent episodes in the new batch. The intuition behind 
this is that the frequencies of the majority of episodes do not change much from one batch to the next. 
As a result a small number of episode fall below the new support threshold in the new batch. There is 
also the possibility of some new episodes becoming frequent. This is illustrated in Figure [3j In order to 
efficiently find these sets of episodes, we need to maintain additional information that allows us to avoid 
full-blown candidate generation. We show that this state information is a by-product of Apriori algorithm 
and therefore any extra processing is unnecessary. 

In the Apriori algorithm, frequent patterns are discovered iteratively, in ascending order of their size and 
it is often referred to as a levelwise procedure. The procedure alternates between counting and candidate 
generation. First a set C % of candidate z-size patterns is created by joining the frequent (i — l)-size itemsets 
found in the previous iteration. Then the data is scanned for determining the frequency or count of each 
candidate pattern and the frequent i-size patterns are extracted from the candidates. An interesting ob- 
servation is that all candidate episodes that are not frequent constitute the negative border of the frequent 
lattice. This is true because, in the Apriori algorithm, a candidate pattern is generated only when all its 
subpatterns are frequent. The usual approach is to discard the border. For our purposes, the patterns in 
the border contain the information required to identify the change in the frequent sets from one batch to 
the next. 

The pseudocode for incrementally mining frequent patterns in batches is listed in Algorithm [TJ Let the 
frequent episodes of size-i be denoted by F l s . Similarly, the border episodes of size-i are denoted by B\. The 
frequency threshold used in each batch is ft — 9. In the first batch of data, the top-/c patterns are found 
by progressively lowering the frequency threshold fmin by a small amount e (Lines 1-8). Once atleast k 
patterns of size I are found, ff is determined and the mining procedure repeated with a threshold of ft — 9. 
The border patterns generated during level wise mining are retained. 

For subsequent batches, first ff is determined. As shown in Remark [TJ if 9 > 2 A, then the set of frequent 
patterns J~g_± in batch B s _i contains all patterns that can be frequent in the next batch B s . Therefore 
simply updating the counts of all patterns in J~l_i in the batch B s and picking the k th highest frequency 
gives ff (Lines 10-11). The new frequency threshold f m in is set to be ff — 9. The procedure, starting from 
bottom (size-1 patterns) updates the lattice for B s . The data is scanned to determine the frequency of new 
candidates together with the frequent and border patterns from the lattice (Line 15-18). In the first level 
(patterns of size 1), the candidate set is empty. After counting, the patterns from the frequent set F^_i ^ na ^ 
continue to be frequent in the new batch are added to J-j. But if a pattern is no longer frequent it is marked 
as a border set and all its super episodes are deleted (Lines 19-24). This ensures that only border patterns 
are retained in the lattice. All patterns, either from the border set or the new candidate set, that are found 
to be frequent are added to F l s . Such episodes are also added to F* ew . Any remaining infrequent patterns 
belong to border set because otherwise they would have atleast one of infrequent subpatterns and would 
have been deleted at a previous level (Line 24). These patterns are added to B s (Line 30). The candidate 
generation step is required to fill out the missing parts of the frequent lattice. We want to avoid a full 
blown candidate generation. Note that if a pattern is frequent in B s _\ and B s then all its subpatterns are 
also frequent in both B s and B s -\. Any new pattern (0 F^_ 1 U £>f_i) that turns frequent in B s , therefore, 
must have atleast one subpattern that was not frequent in -B s _i but is frequent in B s . All such patterns are 
listed in F^ ew . The candidate generation step (Line 31) for the next level generates only candidate patterns 
with atleast one subpattern G F^ ew . This greatly restricts the number of candidates generated at each level 
without compromising the completeness of the results. 

The space and time complexity of the candidate generation is now Ofli^Aeiul-l^al) instead of 0{\F % S \ 2 ) 
and in most practical cases |i^ eiu | <C \F l s \. This is crucial in a streaming application where processing rate 
must match the data arrival rate. 

For a window W s ending in the batch B s , the set of output patterns can be obtained by picking the 
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Algorithm 1 Mine top-/c -u-persistent patterns. 



Input: A new batch of events B s , the lattice of frequent and border patterns and parameters k and 9 

Output: The lattice of frequent and border patterns (J 7 * ,6*) 



1: if s = 1 then 

2: fmin = high value 

3: while | < k do 

4: Mine patterns with frequency > f m in 

5- fmin — fmin ^ 

6: /| = frequency of the k most frequent pattern E J 7 ! 

7: Mine ^-size patterns in B s with frequency threshold f m in = /| — 9 

8: Store the frequent and border patterns (of size = 1 . . .£) in (J 7 *, B* s ) 

9: else 

10: CountPatterns(J'f_ 1 , J B s ) 

11: Set /| = frequency k th highest frequency (pattern E Fg-i) 

12: Set frequency threshold for B s , f min — (/£ — 9) 

13: C 1 — ip {New candidate patterns of size = 1} 

14: for i = 1 ... £ - 1 do 

15: J- l s — ip {Frequent patterns of size i} 

16: B\ = ip {Border patterns of size i} 

17: F n ew = f {List of newly frequent Patterns} 

18: CountPatterns(J r ]_ 1 U B l s _ 1 U C\ B s ) 

19: for a e J 7 l_ 1 do 

20: if f s (a) > f min then 

21: F s = J* U {a} 

22: else 

23: S| U {a} 

24: Delete all its super-patterns from (J 7 *.!, B*_ x ) 

25: for a E B\_ x UC do 

26: if / s (a) > / mm then 

27: T\ = T\ U {a} 

28: ^ = F l new U {a} 

29: else 

30: B i s =B i s l] {a} 

31: C 4+1 = GenerateCandidate 1+ i(i^ eu ,, J 7 ]) 

32: return (J 7 *,/?*) 



top-A; most frequent patterns from the set J-g . Each pattern also maintains a list that stores its batch-wise 
counts is last m batches. The window frequency is obtained by adding these entries together. The output 
patterns are listed in decreasing order of their window counts. 

Example 2. In this example we illustrate the procedure for incrementally updating the frequent patterns 
lattice as a new batch B s is processed (see Figure^). 

Figure^A) shows the lattice of frequent and border patterns found in the batch B s —\. ABCD is a 4-size 
frequent pattern in the lattice. In the new batch B s , the pattern ABCD is no longer frequent. The pattern 
CDXY appears as a new frequent pattern. The pattern lattice in B s is shown in Figure^B). 

In the new batch B s , AB falls out of the frequent set. AB now becomes the new border and all its 
super-patterns namely ABC , BCD and ABCD are deleted from the lattice. 

At level 2, the border pattern XY turns frequent in B s . This allows us to generate DXY as a new 3-size 
candidate. At level 3, DXY is also found to be frequent and is combined with CDX which is also frequent 
in B s to generate CDXY as a 4-size candidate. Finally at level 4, CDXY is found to be frequent. This 
shows that border sets can be used to fill out the parts of the pattern lattice that become frequent in the new 
data. 
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Figure 4: Incremental lattice update for the next batch B s given the lattice of frequent and border patterns 
in B s ^ 1 . 
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4.3 Estimating A dynamically 

The parameter A in the bounded rate change assumption is a critical parameter in the entire formulation. 
But unfortunately the choice of the correct value for A is highly data-dependent. In the streaming setting, 
the characteristics of the data can change over time. Hence one predetermined value of A cannot be provided 
in any intuitive way. Therefore we estimate A from the frequencies of i-size episodes in consecutive windows. 
We compute the differences in frequencies of episodes that are common in consecutive batches. Specifically, 
we consider the value at the 75th percentile as an estimate of A. We avoid using the maximum change as it 
tends to be noisy. A few patterns exhibiting large changes in frequency can skew the estimate and adversely 
affect the mining procedure. 

5 Results 

In this section we present results both on synthetic data and data from real neuroscience experiments. We 
compare the performance of the proposed streaming episode mining algorithm on synthetic data to quantify 
the effect of different parameter choices and data characteristics on the quality of the top-k episodes reported 
by each method. Finally we show the quality of results obtained on neuroscience data. 

For the purpose of comparing the quality of results we setup the following six variants of the mining 
frequent episodes: 

Alg 0: This is the naive brute force top-k mining algorithm that loads an entire window of events at a time 
and mines the top-k episode by repeatedly lowering the frequency threshold for mining. When a new 
batch arrives, events from the oldest batch are retired and mining process is repeated from scratch. 
This method acts as the baseline for comparing all other algorithms in terms of precision and recall. 

Alg 1: The top-k mining is done batch- wise. The top-k episodes over a window are reported from within 
the set of episodes that belong to the batch-wise top-k of atleast one batch in the window. 

Alg 2: Here the algorithm is same is above, but once an episodes enters the top-k in any of the batches 
in a window, it is tracked over several subsequent batches. An episode is removed from the list of 
episodes being tracked if it does not occur in the top-k of last m consecutive batches. This strategy 
helps obtaining a larger candidate set and also in getting more accurate counts of candidate patterns 
over the window. 

Alg 3: This algorithm uses a batch-wise frequency threshold fi — 25 which ensures that the top-k episodes 
in the next batch B s+ i are contained in the frequent lattice of B s . This avoids multiple passes of the 
data while trying to obtain k most frequent episodes lowering the support threshold iteratively. The 
patterns with frequency between /| and /| — 25 also improve the overall precision and recall with 
respect to the window. 

Alg 4: In this case the batch-wise frequency threshold is fi — 2(m — v)5 which guarantees finding all (v, k)- 
persistent episodes in the data. We report results for v = 3m/4 and v = m/2. 

Alg 5: Finally, this last algorithm uses a heuristic batchwise threshold of /| — m(2 — — — (^-) 2 )A. Again 
we report results for v = 3m/4 and v = m/2. 

5.1 Synthetic Datasets 

The datasets we used for experimental evaluation are listed in Table [2j The name of the data set is listed in 
Column 1, the length of the data set (or number of time-slices in the data sequence) in Column 2, the size 
of the alphabet (or total number of event types) in Column 3, the average rest firing rate in Column 4 and 
the number of patterns embedded in Column 5. In these datasets the data length is varied from - million to 
- million events, the alphabet size is varied from 1000 to 5000, the resting firing rate from 10.0 to 25.0, and 
the number of patterns embedded in the data from 25 to 50. 
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Table 2 


: Datasets 




Dataset 


Alphabet 


Rest Firing 


Number of 


Name 


Size 


Rate 


Patterns 


Al 


500 


10.0 


50 


A2 


1000 


10.0 


50 


A3 


5000 


10.0 


50 


Bl 


1000 


2.0 


50 


B2 


1000 


10.0 


50 


B3 


1000 


25.0 


50 


CI 


1000 


10.0 


10 


CI 


1000 


10.0 


25 


CI 


1000 


10.0 


50 



Data generation model: The data generation model for synthetic data is based on the inhomogeneous 
Poisson process model for evaluating the algorithm for learning excitatory dynamic networks [13J. We 
introduce two changes to this model. First, in order to mimic real data more closely in the events that 
constitute the background noise the event-type distribution follows a power law distribution. This gives the 
long tail characteristics to the simulated data. 

The second modification was to allow the rate of arrival of episodes to change over time. As time 
progresses, the frequency of episodes in the recent window or batch slowly changes. We use a randomized 
scheme to update the connection strengths in the neuronal simulation model. The updates happen at the 
same timescale as the batch sizes used for evaluation. 

5.2 Comparison of algorithms 

In Fig. [5j we compare the five algorithms — Alg 1 through Alg 5 — that report the frequent episodes over 
the window looking at one batch at a time with the baseline algorithm Alg that stores and processes 
the entire window at each window slide. The results are averaged over all 9 data sets shown in Table [2j 
We expect to marginalize the data characteristics and give a more general picture of each algorithm. The 
parameter settings for the experiments are shown in Table [3j Fig. [5] (a) plots the precision of the output of 
each algorithm compared to that of Alg (treated as ground truth). Similarly, Fig [5] (b) shows the recall. 
Since the size of output of each algorithm is roughly k, the corresponding precision and recall numbers are 
almost the same. Average runtimes are shown in Fig. [5] (c) and average memory requirement in MB is shown 
in Fig.[5](d). 



Table 3: Parameter settings 



Parameter 


Value (s) 


Batch size Tb 

Number of batches in a window m 
v, in (v, A;)-persistence 
k in (v, A;)-persistence and in top-A; 
£ - size of episode 


10 5 sec (~ 1 million events per batch) 

10 (5,15) 

0.5m, 0.75m 

25, 50 

4 



We consistently observe that Alg 1 and 2 give lower precision and recall values compared with any other 
algorithm. This reinforces our observation that the top-A; patterns in a window can be much different from 
the top-A; patterns in the constituent batches. Alg 2 provides only a slight improvement over Alg 1 by 
tracking an episode once it enters the top-A; over subsequent batches. This improvement can be attributed 
to the fact that window frequencies of patterns that were once in the top-A; is better estimated. Alg 3 gives 
higher precision and recall compared to Alg 1 and 2. The frequency threshold used in Alg 3 is given by 
Theorem [TJ Using this result we are able to estimate the value f% — 25 by simply counting the episodes that 
are frequent in the previous batch. This avoids multiple iterations required in general for finding the top-A; 
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Figure 5: Comparison of average performance of different streaming episode mining algorithm. Alg 1 and 2 
give lower precision and recall values compared with any other algorithms. Overall the proposed methods 
give atleast one order of magnitude improvement over the baseline algorithm (Alg 0) in terms of both time 
and space complexity. 
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patterns. Fortunately this threshold also results in significantly higher support and precision. 

We ran Alg 4 and 5 for two different values of v, viz. v = m/2 and v = 3m/ A. Both these algorithms 
guarantee finding all (v, A;)-persistent patterns for respective values of v. For the sake of comparison we also 
include episodes that exceeded the frequency threshold prescribed by (v, fc)-persistence, but do not appear 
in top-fc of atleast v batches. We observe that the precision and recall improves a little over Alg 3 with 
reasonable increase in memory and runtime requirements. For a higher value of v, this algorithm insists 
that the frequent patterns much persist over more batches. This raises the support threshold and as a result 
there is improvement in terms of memory and runtime, but a small loss in precision and recall. Note that 
the patterns missed by the algorithm are either not (v, A;)-persistent or our estimation of 5 has some errors 
in it. In addition, Alg 5 gives a slight improvement over Alg 4. This shows that our heuristic threshold is 
effective. 

Overall the proposed methods give atleast one order of magnitude improvement over the baseline algo- 
rithm in terms of both time and space complexity. 

5.2.1 Performance over time 

Next we consider one dataset A2 with number of event types = 1000, the average resting firing rate as 10 
Hz, and the number of embedded patterns = 50. On this data we show how the performances of the five 
algorithms change over time. The window size is set to be m = 10 and the batch size = 10 5 sec. Fig.[6]shows 
the comparison over 50 contiguous batches. Fig. [6] (a) and (b) show the way precision and recall evolve over 
time. Fig. [6] (c) and (d) show the matching memory usage and runtimes. 

The data generation model allows the episode frequencies to change slowly over time. In the dataset 
used in the comparison we change the frequencies of embedded episodes at two time intervals: batch 15 to 
20 and batch 35 to 42. In Fig. [fj] (a) and (b), we have a special plot shown by the dashed line. This is 
listed as Alg in the legend. What this line shows is the comparison of top-A; episodes between consecutive 
window slides. In other words the top- A: episodes in window W s -i are considered as the predicted output to 
obtain the precision and recall for W s . The purpose of this curve is to show how the true top-A; set changes 
with time and show how well the proposed algorithms track this change. 

Alg 1 and 2 perform poorly. On an average in the transient regions (batch 15 to 20 and batch 35 to 42) 
they perform 15 to 20% worse than any other method. Alg 3, 4 and 5 (for v=0.75m and v=0.5m) perform 
consistently above the reference curve of Alg 0. It expected of any reasonable algorithm to do better than 
the algorithm which uses the top-A; of W s _i to predict the top-A: of the window W s . The precision and recall 
performance are in the order Alg 3 < Alg 4 v=0.75m < Alg 4 v=0.5m < Alg 5 v=0.75m < Alg 4 v=0.5m. 
This is in the same order as the frequency thresholds used by each method, and as expected. 

In terms of runtime and memory usage, the changing top-A; does not affect these numbers. The lowest 
runtimes are those of Alg 3. The initial slope in the runtimes and memory usage seen in Algo 0, is due to 
the fact that the algorithm loads the entire window, one batch at a time into memory. In this experiment 
the window consists of m = 10 batches. Therefore only after the first 10 batches one complete window span 
is available in memory. 

5.2.2 Effect of Data Characteristics 

In this section we present results on synthetic data with different characteristics, namely, number of event 
types (or alphabet size), noise levels and number of patterns embedded in the data. 

In Fig. [7]we report the effect alphabet size on the quality of result of the different algorithms. In datasets 
Al, A2 and A3 the alphabet size, i.e. the number of distinct event types, is varied from 500 to 5000. We 
observe that for smaller alphabet sizes the performance is better. Alg 1 and 2 perform consistently worse 
that the other algorithm for different alphabet sizes. 

In this experiment we find that the quality of results for the proposed algorithms is not very sensitive to 
alphabet size. The precision and recall numbers drop by only 2-4%. This is quite different from the pattern 
mining setting where the user provides a frequency threshold. In our experience alphabet size is critical in 
the fixed frequency threshold based formulation. For low thresholds, large alphabet sizes can quickly lead to 
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Figure 6: Comparison of the performance of different streaming episode mining algorithms over a sequence 
of 50 batches (where each batch is 10 5 sec wide and each window consists of 10 batches). 



uncontrolled growth in the number of candidates. In our formulation the support threshold is dynamically 
readjusted and as a result the effect of large alphabet size is attenuated. 

Next, in Fig. [8j we show the effect of noise. The average rate of firing of the noise event-types (event 
types that do not participate in pattern occurrences) is varied from 2.0 Hz to 25 Hz. The precision and 
recall of Alg 1 and 2 degrade quickly with increase in noise. A small decrease in precision and recall of Alg 
3 and Alg 4 is seen. But the performance of Alg 5 (for both v=0.75m and v=0.5m) stays almost at the 
same level. It seems that the frequency threshold generated by Alg 5 is sufficiently low to finds the correct 
patterns even at higher noise level but not so low as to require significantly more memory (« 400 MB) or 
runtime 70 sec per batch at noise = 25.0 Hz for v=0.5m) as compared to other algorithms. 

In Fig. [9j we change the number of patterns embedded in the data and study the effect. The number of 
embedded patterns vary from 10 to 50. Once again the performance of our proposed methods is fairly flat 
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Figure 7: Effect of alphabet size. The proposed algorithms are robust to large alphabet sizes. Precision and 
recall drop by only 2-4% going from alphabet size of 500 to 5000. 

in all the metrics. Alg 3, 4 and 5 are seen to be less sensitive to the number of patterns embedded in the 
data than Alg 1 and 2. 



5.2.3 Effect of Parameters 

So far the discussion has been about the effect of data characteristics of the synthetic data. The parameters 
of the mining algorithms were kept fixed. In this section we look at two important parameters of the 
algorithms, namely, the batch size T& and the number of batches that make up a window, m. 

In Fig. [Io{ the quality and performance metrics are plotted for three different batch sizes: 10 3 , 10 4 and 
10 5 (in sec). Batch size appears to have a significant effect on precision and recall. There is a 10% decrease 
in both precision and recall when batch size is reduced to 10 3 sec from 10 5 . But note that a 100 fold decrease 
in batch size only changes quality of result by 10%. 

It is not hard to imagine that for smaller batch sizes the episode statistics can have higher variability in 
different batches resulting in a lower precision and recall over the window. As the size of the batch grows 
the top-A; in the batch starts to resemble the top-k of the window. Transient patterns will not be able to 
gather sufficient support in a large batch size. 

As expected, the runtimes and memory usage are directly proportional to the batch size in all cases. 
The extra space and time is required only to handle more data. Batch size does not play a role in growth 
of number of candidates in the mining process. 



Next in Fig. 11, we show how the number of windows in a batch affect the performance. Precision 
and recall are observed to decrease linearly with the number of batches in a window in Fig. [Tl~|a) and (b), 
whereas the memory and runtime requirements grow linearly with the number of batches. The choice of 
number of batches provides the trade-off between the window size over which the user desires the frequent 
persistent patterns and the accuracy of the results. For larger window sizes the quality of the results will 
be poorer. Note that the memory usage and runtime does not increase much for algorithms other than Alg 



(see Fig. 11 (c) and (d)). Because these algorithms only process one batch of data irrespective of the 
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Figure 8: Effect of noise. In terms of precision and recall Alg 5 is most robust to noise in the data. 

number of batches in window. Although for larger windows the batch-wise support threshold decreases. In 
the synthetic datasets we see that this does not lead to unprecedented increase in the number of candidates. 



5.3 Multi-neuronal Data 

Multi-electrode arrays provide high throughput recordings of the spiking activity in neuronal tissue and 
are hence rich sources of event data where events correspond to specific neurons being activated. We used 
the data from dissociated cortical cultures gathered by Steve Potter's laboratory at Georgia Tech [15] over 
several days. This is a rich collection of recordings from a 64-electrode ME A setup. 

We show the result of mining frequent episodes in the data collected over several days from Culture 
6 |15] . We use a batch size of 150 sec and all other parameters for mining are the same as that used for the 



synthetic data. The plots in Fig. 12 show the performance of the different algorithms as time progresses. 
Alg 1 and 2 give very low precision values which implies that the top-k in a batch is much different from 
the top-k in the window. Alg 3, 4 and 5 perform equally well over the MEA data with Alg 3 giving the best 
runtime performance. 

At times Alg 3 requires slightly higher memory than the other algorithm (Alg 4 and 5). This may seem 
counter intuitive as Alg 4 and 5 use lower frequency threshold. But since 5 is dynamically estimated from 
all episodes being tracked by the algorithm it can easily be the case that the 5 estimates made by Alg 3 are 
looser and hence result in higher memory usage. 



6 Related work 

Most prior work in streaming pattern mining is related to frequent itemsets and sequential patterns [6 ; , 5J El [3] . 
Some interesting algorithms have also been proposed for streaming motif mining in time-series data |llj . 
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Figure 9: Effect of number of embedded patterns. 
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Figure 10: Effect of batchsize. Larger batch sizes have higher precision and recall. Precision and recall 
increase logarithmically with batch size. (Note that x-axis is in log scale) 
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But these methods do not easily extend to other pattern classes like episodes and partial orders. To our 
knowledge, there has been very little work in the area of mining patterns in discrete event streams. In this 
section we discuss some of the existing methods for itemsets, sequential patterns, and motifs. 

Karp et al proposed a one pass streaming algorithm for finding frequent events in an item sequence [6]. 
The algorithm, at any given time, maintains a set K of event types and their corresponding counts. Initially, 
this set is empty. When an event is read from the input sequence, if the event type exists in the set then 
its count is incremented. Otherwise the event type is inserted into the set K with count 1. When the size 
of the set K exceeds |1/^J> the count of each event type in the set is decremented by 1 (and deleted from 
the set if count drops to zero). The key property is that any event type that occurs at least nO times in the 
sequence is in the set K. Consider an event type that occurs / times in the sequence, but is not in K. Each 
occurrence of this event type is eliminated together with more than [l/0\ — 1 occurrences of other event 
types achieved by decrementing all counts by 1. Thus, at least a total of f/9 elements are eliminated. Thus 
f/9 < n, where n is the number of events in the sequences and hence, / < n6. This method guarantees no 
false negatives for a given support threshold. But the space and time complexity of this algorithm varies 
inversely with the support threshold chosen by the user. This can be a problem when operating at low 
support thresholds. In [5], this approach was extended to mine frequent itemsets. 

Lossy counting constitutes another important class of streaming algorithms proposed by Manku and 
Motwani in 2002 [8j . In this work an approximate counting algorithm for itemsets is described. The algorithm 
stores a list of tuples which comprise an item or itemset, a lower bound on its count, and a maximum error 
term (A). When processing the i th item, if it is currently stored then its count is incremented by one; 
otherwise, a new tuple is created with the lower bound set to one, and A set to \ie\ . Periodically, each 
tuple whose upper bound is less than [ie\ is deleted. This technique guarantees that the percentage error in 
reported counts is no more than e and it is also shown that the space used by this algorithm is 0(i log en) for 
itemsets. Unfortunately, this method requires operating at very low support threshold e in order to provide 
small enough error bounds. In jTU], the pattern growth algorithm - PrefixSpan [T3] for mining sequential 
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Figure 12: Comparison of performance of different algorithms on real multi-neuronal data. 

patterns was extended to incorporate the idea of lossy counting. 

In [3], the authors propose a new frequency measure for itemsets over data streams. The frequency of an 
itemset in a stream is defined as its maximal frequency over all windows in the stream from any point in the 
past until the current time that satisfy a minimal length constraint. They present an incremental algorithm 
that produces the current frequencies of all frequent itemsets in the data stream. The focus of this work is 
on the new frequency measure and its unique properties. 

In [11] an online algorithm for mining time series motifs was proposed. The algorithm uses an interesting 
data structure to find a pair of approximately repeating subsequences in a window. The Euclidean distance 
measure is used to measure the similarity of the motif sequences in the window. Unfortunately this notion 
does not extend naturally to discrete patterns. Further, this motif mining formulation does not explicitly 
make use of a support or frequency threshold and returns exactly one pair of motifs that are found to be 
the closest in terms of distance. 
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A particular sore point in pattern mining is coming up with a frequency threshold for the mining process. 
Choice of this parameter is key to the success of any effective strategy for pruning the exponential search 
space of patterns. Mining the top- A; most frequent patterns has been proposed in the literature as a more 
intuitive formulation for the end user. In [13] we proposed an information theoretic principle for determining 
the frequency threshold that is ultimately used in learning a dynamic Bayesian network model for the data. 
In both cases the idea is to mine patterns at the highest possible support threshold to either outputs the 
top-A; patterns or patterns that satisfy a minimum mutual information criteria. This is different from the 
approach adopted, for example, in lossy counting where the mining algorithm operates at support threshold 
proportional to the error bound. Therefore, in order to guarantee low errors, the algorithm tries to operate 
at the lowest possible threshold. 

An episode or a general partial order pattern can be thought of as a generalization of itemsets where 
each item in the set is not confined to occur within the same transaction (i.e. at the same time tick) and 
there is additional structure in the form of ordering of events or items. In serial episodes, events must occur 
in exactly one particular order. Partial order patterns allow multiple orderings. In addition there could be 
repeated event types in an episode. The loosely coupled structure of events in an episode results in narrower 
separation between the frequencies of true and noisy patterns (i.e. resulting from random co-occurrences of 
events) and quickly leads to combinatorial explosion of candidates when mining at low frequency thresholds. 
Most of the itemset literature does not deal with the problem of candidate generation. The focus is on 
counting and not so much on efficient candidate generation schemes. In this work we explore ways of doing 
both counting and candidate generation efficiently. Our goal is to devise algorithms that can operate at as 
high frequency thresholds as possible and yet give certain guarantees about the output patterns. 

7 Conclusions 

In this paper, we have studied the problem of mining frequent episodes over changing data streams. In 
particular our contribution in this work is three fold. We unearth an interesting aspect of temporal data 
mining where the data owner may desire results over a span of time in the data that cannot fit in the 
memory or be processed at a rate faster than the data generation rate. We have proposed a new sliding 
window model which slides forward in hops of batches. At any point only one batch of data is available for 
processing. We have studied this problem and identified the theoretical guarantees one can give and the 
necessary assumptions for supporting them. 

In many real applications we find the need for characterizing pattern not just based on their frequency but 
also their tendency to persist over time. In particular, in neuroscience, the network structure underlying an 
ensemble of neurons changes much slowly in comparison to the culture wide periods bursting phenomenon. 
Thus separating the persistent patterns from the bursty ones can give us more insight into the underlying 
connectivity map of the network. We have proposed the notion of (v,k) persistent patterns to address this 
problem and outlined methods to mine all (v, A;)-persistent patterns in the data. Finally we have provided 
detailed experimental results on both synthetic and real data to show the advantages of the proposed 
methods. 

Finally, we reiterate that although we have focused on episodes, the ideas presented in this paper could 
be applied to other pattern classes with similar considerations. 
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